-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Return local paths to Common Voice #3664
Conversation
if archive_iterator is not None: | ||
yield from self._generate_examples_streaming(archive_iterator, filepath, path_to_clips) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key part: if we get an archive files iterator, then use the new streaming logic, otherwise use the old pre-streaming logic.
Cool thanks for giving it a try @anton-l ! Would be very much in favor of having "real" paths to the audio files again for non-streaming use cases. At the same time it would be nice to make the audio data loading script as understandable as possible so that the community can easily add audio datasets in the future by looking at this one as an example. Think if it's clear for a contributor how to add an audio datasets script that works for the standard non-streaming case while it is easy to extend it afterwards to a streaming dataset script, then this would be perfect |
@anton-l @patrickvonplaten @lhoestq Is it possible somehow to provide this logic inside the library instead of a loading script so that we don't need to completely rewrite all the scripts for audio datasets and users don't have to care about two different loading approaches in the same script? 🤔 |
Not sure @lhoestq - what do you think? Now that we've corrected the previous resampling bug, think this one here is of high importance. @lhoestq - what do you think how we should proceed here? |
Yes let's do this :) Maybe we can change the behavior of In this case, a dataset would need to have something like this: for path, f in files:
yield id_, {"audio": {"path": path, "bytes": f.read() if not is_local_file(path) else None}} Alternatively, we can allow this if we consider that for path, f in files:
yield id_, {"audio": {"path": path, "bytes": f.read()}} Note that in this case the file is read for nothing though (maybe it's not a big deal ?) Let me know if it sounds good to you and what you'd prefer ! |
@lhoestq I'm very much in favor of your first aproach! With the full paths returned I think we won't even need to mess with |
@lhoestq I also like the idea and favor your first approach to avoid an unnecessary read and make yielding faster. |
Looks cool - thanks for working on this. I just feel strongly about |
Hi ! I started implementing this but I noticed that returning an absolute path is breaking for many datasets that do things like for path, f in files:
if path.startswith(data_dir):
... so I think I will have to add a parameter to This makes me also think that in streaming mode it could also return a local path too, if we think that writing and deleting temporary files on-the-fly while iterating over the streaming dataset is ok. |
@lhoestq I think it is a good idea to rollback to extracting the archives locally in non-streaming mode, as far as (as you mentioned) we do not store the bytes in the Arrow file for those cases to avoid "doubling" the disk space usage. On the other hand, I don't like:
Unfortunately, in order to fix the datasets that are breaking after the rollback, I would suggest fixing their scripts so that the paths are handled more robustly (considering that they can be absolute or relative). |
I agree with Albert, fixing all of the audio datasets isn't too big of a deal (yet). I can help with those if needed :) |
Ok cool ! I'm completely rolling it back then |
Alright I did the rollback and now you can get local paths :) |
I'll fix the CI tomorrow x) |
Ok according to the CI there around 60+ datasets to fix |
I can help with them too :) |
Here is the full list to keep track of things:
|
I'll do my best to fix as many as possible tomorrow :) |
the audio datasets are fixed if I didn't forget anything :) btw what are we gonna do with the community ones that would be broken after the fix? |
Closing in favor of #3736 |
Fixes #3663
This is a proposed way of returning the old local file-based generator while keeping the new streaming generator intact.
TODO: